In [2]:
# this is a single line comment
print "trying out some comments In class"
"""and here
is a multi-line comment"""
print "python ftw"
Variables are aliases for data. This allows the developer to use the name for a particular value rather than the value it self. This makes the code more readable, and allows various optimizations to make a program run more efficiently.
In python, a variable can be named almost anything, according to the whims of the programmer. You can use any letter, the special characters "_" and every number provided you do not start with it. White spaces and signs with special meanings in Python, as "+" and "-" are not allowed. Variable names are case-sensitive. The common pattern is to separate words in variable names with underscores "_".
Variables are declared by stating the variable name and assigning to it using the "=" operator. At any time, you can reassign a value to a variable.
In [3]:
a_variable = "yay!"
another_variable = "woo!"
print a_variable
a_variable = "uh oh!" # reassigning
print a_variable
a_variable = another_variable # reassigning again
print a_variable
In [3]:
str_1 = "hello"
str_2 = "world"
print str_1 + " " + str_2 + "!" # note that + concatenates strings
str_3 = """
this is a multiline
string!
"""
print str_3
In [4]:
int_1 = 1
int_2 = -2
int_3 = 100
print int_1 + int_2 + int_3 # for integers, + is just "plus"
In [5]:
float_1 = 1.2
float_2 = -4.0
float_3 = 10.0
print float_1 + float_2 + float_3
In [6]:
bool_1 = True
bool_2 = False
print bool_1
print bool_2
Python provides a variety of operations for performing common tasks on the primitive data types presented above. Of course, this list isn't complete, and the core functionality provided by python is greatly extended by library code, some of which will be discussed below. Note that operations can be performed either on "literal primitives" or on variables storing some primitive.
We've already seen one of the most common string operators, +
, used for string concatenation. Below are some of the more commonly used string operations:
+
: concatenate two stringslen(str)
: length of a string, number of charactersstr.upper()
: returns an uppercase version of a stringstr.lower()
: returns a lowercase version of a stringhaystack.index(needle)
: searches haystack for needle, prints the position of the first occurrence, indexed from 0. Note, throws an error if needle isn't found.str_1.count(str_2)
: counts the number of occurrences of one string in another.haystack.startswith(needle)
: does a the haystack string start with the needle string?haystack.endswith(needle)
: does a the haystack string end with the needle string?str_1.split(str_2)
: split the first string at every occurrence of the second string. Outputs a list (see below).==
: are the two operand strings the same?str.strip()
: remove any whitespace from the left or right of the string, including newlines. A better list of string operations is available here.
In [7]:
print "concatenation:"
print str_1 + " " + str_2
print str_1 + " everybody"
print "length:"
print len(str_1)
print len(str_1 + " " + str_2)
print "string casing:"
print str_1.upper()
print "HELLO".lower()
print "string indexing:"
print "hello".index("ll")
print "hello".upper().index("LL")
print "string count:"
print str_1.count("l")
print str_1.count("ll")
print "starts with & endswith:"
print "hello".startswith("he")
print "hello".endswith("world")
print "split:"
print "practical data science".split(" ")
print "hello".split(" ")
print "equality:"
print str_1 == "hello"
print str_1 == "HELLO"
There are a bunch of common mathematical operations available on numeric types in python. If an operation is being performed on two integers, then the output will also be an integer. If one of the operands is a float, then the remaining operand will be cast into a float, and the result will likewise be a float.
+
: plus, add two numbers-
: minus, subtract two numbers. If put before a single numeric value, takes the negative of that value. *
: multiply two numbers/
: divide the first operand by the second.%
: modulous, what is the remainder when the first number is divided by the second?
In [8]:
print "addition:"
print 1+1
print 1.5 + 1
print "subtraction:"
print 1-1
print 1-1.0
print "negation:"
x = 5
print -x
print "multiplication:"
print 3*3
print 2*2.0
print "division:"
print 5/2 # integer division!
print 5/2.0
print "modulous:"
print 5%2
print 5.5%2
There are also a bunch of comparison operators on numeric values:
==
: equality of values<
: less than<=
: less than or equal to>
: greater than>=
: greater than or equal to!=
: not equal to, different thanThese all return a boolean with a value that depends on the outcome of the comparison.
In [9]:
print "equals:"
print 1 == 2
print 1 == 1
print 1 == 1.0
print "comparison:"
print 1 > 0
print 1 > 1
print 1.0 > 1
print 1 >= 1
print 1.0 != 1
Frequently, one wants to combine or modify boolean values. Python has several operations for just this purpose:
not a
: returns the opposite value of a
.a and b
: returns true if and only if both a
and b
are true.a or b
: returns true either a
or b
are true, or both.Like mathematical expressions, boolean expressions can be nested using parentheses.
Often one wants to embed other information into strings, sometimes with special formatting constraints. In python, one may insert special formatting characters into strings that convey what type of data should be inserted and where, and how the "stringified" form should be formatted. For instance, one may wish to insert an integer into a string:
In [10]:
print "To be or not %d be" % 2
Note the %d
formatting (or conversion) specifier in the string. This is stating that you wish to insert an integer value (more on these conversion specifiers below). Then the value you wish to insert into the string is separated by a %
character placed after the string. If you wish to insert more than one value into the string being formatted, they can be placed in a comma separated list, surrounded by parentheses after the %
:
In [11]:
print "%d be or not %d be" % (2, 2)
In detail, a conversion specifier contains two or more characters which must occur in order with the following components:
%
character which marks the start of the specifier.
" followed by the number of digits precision. For a more detailed treatment on string formatting options, see here.
Some common conversion flag characters are:
d
: Signed integer decimal. i
: Signed integer decimal. e
: Floating point exponential format (lowercase).E
: Floating point exponential format (uppercase).f
: Floating point decimal format.c
: Single character (accepts integer or single character string). r
: String (converts any python object using repr()).s
: String (converts any python object using str()).
In [12]:
print "%d %s or not %04.1f %c" % (2, "be", 2, 'b')
We have covered in detail much of the basics of python's primitive data types. Its now useful to consider how these basic types can be collected in ways that are meaningful and useful for a variety of tasks. Data structures are a fundamental component of programming, a collection of elements of data that adhere to certain properties, depending on the type. In these notes, we'll present three basic data structures, the list, the set, and the dictionary. Python data structures are very rich, and beyond the scope of this simple primer. Please see the documentation for a more complete view.
A list, sometimes called and array or a vector is an ordered collection of values. The value of a particular element in a list is retrieved by querying for a specific index into an array. Lists allow duplicate values, but but indicies are unique. In python, like most programming languages, list indices start at 0, that is, to get the first element in a list, request the element at index 0. Lists provide very fast access to elements at specific positions, but are inefficient at "membership queries," determining if an element is in the array.
In python, lists are specified by square brackets, [ ]
, containing zero or more values, separated by commas. Lists are the most common data structure, and are often generated as a result of other functions, for instance, a_string.split(" ")
.
To query a specific value from a list, pass in the requested index into square brackets following the name of the list. Negative indices can be used to traverse the list from the right.
In [13]:
a_list = [1, 2, 3]
another_list = ["a", "b", "c"]
empty_list = []
mixed_list = [1, "a"]
print another_list[1]
print a_list[-1] # indexing from the right
Some common functionality of lists:
list.append(x)
: add an element ot the end of a listlist_1.extend(list_2)
: add all elements in the second list to the end of the first listlist.insert(index, x)
: insert element x into the list at the specified index. Elements to the right of this index are shifted overlist.pop(index)
: remove the element at the specified positionlist.index(x)
: looks through the list to find the specified element, returning it's position if it's found, else throws an errorlist.count(x)
: counts the number of occurrences of the input elementlist.sort()
: sorts the list of itemslist.reverse()
: reverses the order of the listA set is a data structure where all elements are unique. Sets are unordered. In fact, the order of the elements observed when printing a set might change at different points during a programs execution, depending on the state of python's internal representation of the set. Sets are ideal for membership queries, for instance, is a user amongst those users who have received a promotion?
Sets are specified by curly braces, { }
, containing one or more comma separated values. To specify an empty list, you can use the alternative construct, set()
.
In [14]:
some_set = {1, 2, 3, 4}
another_set = {4, 5, 6}
empty_set = set()
The easiest way to check for membership in a set is to use the in
keyword, checking if a needle is "in
" the haystack set.
In [15]:
print 1 in some_set
print 0 in some_set
Some other common set functionality:
set_a.add(x)
: add an element to a setset_a.remove(x)
: remove an element from a setset_a - set_b
: elements in a but not in bset_a | set_b
: elements in a or bset_a & set_b
: elements in both a and bset_a ^ set_b
: elements in a or b but not bothDictionaries, sometimes called dicts, maps, or, rarely, hashes are data structures containing key-value pairs. Dictionaries have a set of unique keys and are used to retrieve the value information associated with these keys. For instance, a dictionary might be used to store for each user, that user's location, or for a product id, the description associated with that product. Lookup into a dictionary is very efficient, and because these data structures are very common, they are frequently used and encountered in practice.
Dictionaries are specified by curly braces, { }
, containing zero or more comma separated key-value pairs, where the keys and values are separated by a colon, :
. Like a list, values for a particular key are retrieved by passing the query key into square brackets.
In [16]:
a_dict = {"a":1, "b":2, "c":3}
another_dict = {"c":5, "d":6}
empty_dict = {}
print a_dict["b"]
Like the set, the easiest way to check if a particular key is in a map is through the in
keyword:
In [17]:
print "a" in a_dict
print "b" in another_dict
Some common operations on dictionaries:
dict.keys()
: returns a list containing the keys of a dictionarydict.values()
: returns a list containing the values in a dictionarydict.pop(x)
: removes the key and its associated value from the dictionaryThere are many opportunities to combine data types in python. Lists can be populated by arbitrary data structures. Similarly, you can use any type as the value in a dictionary. However, the elements of sets, and the keys of dictionaries need to have some special properties that allow the mechanics of the data structure to determine how to store the element.
Aside: to use a particular element in a set or as a key in a dictionary, it must define a hash function, __hash__
. In a nutshell, a hash function maps a data element to a number in a predefined range, based on the characteristics of that element. Because the contents of a data structure might change, so too would the value of their associated __hash__
function, causing problems for the algorithms powering sets and dictionaries.
In [18]:
print "lists of lists"
lol = [[1, 2, 3], [4, 5, 6]]
lol_2 = [[4, 5, 6], [7, 8, 9]]
print lol
print "lists of lists of lists"
lolol = [lol, lol_2]
print lolol
print "retrieving data from this data structure"
print lolol[0]
print lolol[0][0]
print lolol[0][0][0]
print "data structures as values in a dictionary"
dlol = {"lol":lol, "lol_2":lol_2}
print dlol
print "retrieving data from this dictionary"
print dlol["lol"]
print dlol["lol"][0]
print dlol["lol"][0][0]
We've spent some time going into detail about some of the data types and structures available in python. It's now time to talk about how to navigate through some of this data, and use data to make decisions. Traversing over data and making decisions based upon data are a common aspect of every programming language, known as control flow. Python provides a rich control flow, with a lot of conveniences for the power users. Here, we're just going to talk about the basics, to learn more, please consult the documentation.
A common theme throughout this discussion of control structures is the notion of a "block of code." Blocks of code are demarcated by a specific level of indentation, typically separated from the surrounding code by some control structure elements, immediately preceeded by a colon, :
. We'll see examples below.
Finally, note that control structures can be nested arbitrarily, depending on the tasks you're trying to accomplish.
If statements are perhaps the most widely used of all control structures. An if statement consists of a code block and an argument. The if statement evaluates the boolean value of it's argument, executing the code block if that argument is true.
In [19]:
if True:
print "duh"
if 1+1 == 2:
print "easy"
if 2+2 == 5:
print "really?"
items = {1, 2, 3}
if 2 in items:
print "found it!"
Each argument in the above if statements is a boolean expression. Often you want to have alternatives, blocks of code that get evaluated in the event that the argument to an if statement is false. This is where elif (else if) and else come in.
An elif is evaluated if all preceeding if or elif arguments have evaluted to false. The else statement is the last resort, assigning the code that gets exectued if no if or elif above it is true. These statements are optional, and can be added to an if statement in any order, with at most one code block being evaluated. An else will always have it's code be exectued, if nothing above it is true.
In [20]:
if 1+2 == 2:
print "whoa"
x = 5+1
print "done"
elif 1+1 == 0:
print "that explains it"
elif 5+5 == 9:
print "somethign"
if "ssomething":
print "hi"
else:
print "what I expected"
x = {1,2,3}
if 5 in x:
print "found it"
else:
print "didn't find it"
x.add(5)
print x
if False:
print "shouln't happen"
elif False:
print "should happen"
for statements are a convenient way to iterate through the values contained in a data structure. Going through the elements in a data structure one at a time, this element is assigned to variable. The code block associated with the for statement (or for loop) is then evaluated with this value.
In [21]:
set = {1, 2, 3, 4}
for foobar in set:
print foobar
print "a more complex block"
for num in set:
if num >= 3:
print num+5
print "this also works for lists"
list = [1,2,3]
for num in list:
if num >= 2:
print num+5
print "dictionaries let you iterate through keys, values, or both"
dict = {"a":1, "b":2}
for k in dict.keys():
value = dict[k]
print k
for v in dict.values():
print v
for k,v in dict.iteritems():
if v == dict[k]:
print "whew!"
In [22]:
x = [1,2,3,4,5]
for num in x:
print num
if num > 2:
break
y = ["a", "b", "c", "d"]
for letter in y:
if letter == "a":
continue
print letter
In [10]:
list_of_all_custs = []
custs_with_purch = []
for cust in list_of_all_custs:
if cust.has_purchase():
cust_with_purhc.append(cust)
if len(cust_with_purch) > 100:
break
In [23]:
print range(3) # start at zero, < the specified ceiling value
print range(-5, 5) #from the left value, < right value
print range(-5, 5, 2) #from the left value, to the middle value, incrementing by the right value
for x in range(-5, 5):
if x > 0:
print "%d is positive" % x
Functions assign a name to a block of code the way variables assign names to bits of data. This seeminly benign naming of things is incredibly powerful; alloing one to reuse common functionality over and over. Well-tested functions form building blocks for large, complex systems. As you progress through python, you'll find yourself using powerful functions defined in some of python's vast libraries of code.
Function definitions begin with the def
keyword, followed by the name you wish to assign to a function. Following this name are parentheses, ( )
, containing zero or more variable names, those values that are passed into the function. There is then a colon, followed by a code block defining the actions of the function:
In [24]:
def print_hi():
print "hi!"
def hi_you(name):
print "hi %s!" % name
def square(num):
squared = num*num
return squared
print_hi()
hi_you("josh")
print square(100)
Note that the fucntion square
has a special keyword return. The argument to return is passed to whatever piece of code is calling the function. In this case, the square of the number that was input.
Variables set inside of functions are said to be scoped to those functions: changes, including any new variables created, are only accessible while in the function code block (with some exceptions). If "outside" variables are modified inside a function's context, the contents of that variable are first copied.
Similarly, changes or modifications to a function's arguments aren't reflected once the scope is returned; The variable will continue to point to the original thing. However, it is possible to modify the thing that is passed, assuming that it is mutable.
In [12]:
# inside a function's context, changes to a variable defined outside that
# context aren't reflected once the context is returned
name = "josh"
def do_something():
name = "not josh"
print "something!"
do_something()
print name
# but outside variables can be read!
def do_something_else():
print name
do_something_else()
def do_something_new(some_name):
some_name = "nothing"
print some_name
do_something_new(name)
print name
# mutable objects can be modified
a_list = [1,2,3]
def add_sum(some_list):
s = sum(some_list)
some_list.append(s)
some_list = []
return s
tot = add_sum(a_list)
print tot
print a_list
# try again!
tot = add_sum(a_list)
print tot
print a_list
In [26]:
# variables created in a function aren't accessible
# outside that function's context
def do_something_new():
thing = "123"
print "Hi!"
do_something_new()
print thing
In [1]:
def times_two(input):
input = 2*input
return input
four = 4
print times_two(four)
print four
You'll often be reading data from a file, or writing the output of your python scripts back into a file. Python makes this very easy. You need to open a file in the appropriate mode, using the open
function, then you can read or write to accomplish your task. The open
function takes two arguments, the name of the file, and the mode. The mode is a single letter string that specifies if you're going to be reading from a file, writing to a file, or appending to the end of an existing file. The function returns a file object that performs the various tasks you'll be performing: a_file = open(filename, mode)
. The modes are:
'r'
: open a file for reading'w'
: open a file for writing. Caution: this will overwrite any previously existing file'a'
: append. Write to the end of a file. When reading, you typically want to iterate through the lines in a file using a for loop, as above. Some other common methods for dealing with files are:
file.read()
: read the entire contents of a file into a stringfile.readline()
: read one line of a filefile.write(some_string)
: writes to the file, note this doesn't automatically include any new lines. Also note that sometimes writes are buffered- python will wait until you have several writes pending, and perform them all at oncefile.flush()
: write out any buffered writesfile.close()
: close the open file. This will free up some computer resources occupied by keeping a file open.file.seek(position)
: moves to a specific position within a file. Note that position is specified in bytes. Here is an example using files:
In [13]:
file = open("temp.txt", "w")
list = ["a", "b", "c", "d"]
set = {1, 2, 3, 4}
for x in list:
file.write("letter: %s\n" % x)
for n in set:
file.write("number: %d\n" % n)
file.flush()
file.close()
file_2 = open("temp.txt", "r")
for line in file_2:
print line # note that this doesn't strip off the newlines
file_2.close()
file_3 = open("temp.txt", "r")
print file_3.read()
file_3.close()
# filter rows
file_4 = open("temp.txt", "r")
for line in file_4:
if line.count("a") > 0:
continue
print line.strip() # remove the extra newline.
file_4.close()
# filter columns
file_5 = open("temp.txt", "r")
for line in file_5:
columns = line.strip().split(" ")
if columns[1] != "b":
print columns # prints out the list
print " ".join(columns) # prints it out as a string
file_5.close()
One of the greatest strengths of the python programming language is its rich set of libraries- pre-written code that implements a variety of functionality. For the data scientist, python's libraries (also called "modules") are particularly valuable. With a little bit of research into the details of python's libraries, a lot of common data mining tasks are little more than a function call away. Libraries exist for doing data cleaning, analysis, visualization, machine learning and statistics.
In order to have access to a libraries functionality in a block of code, you must first import it. Importing a library tells python that while executing your code, it should not only consider the code and functions that you have written, but code and functions in the libraries that you have imported.
There are several ways to import modules in python, some have ebetter properties than others. Below we see the preferred general way to import modules. In documentation, you may see other ways to import libraries (from a_library import foo
). There is no risk to just copying this pattern if it is known to work.
Imagine I want to import a library called some_python_library
. This can be done using the import commands. All code below that import statement has access to the library contents.
import some_python_library
: imports the module some_python_library
, and creates a reference to that module in the current namespace. Or in other words, after you’ve run this statement, you can use some_python_library.name
to refer to things defined in module some_python_library
.
import some_python_library as plib
: imports the module some_python_library
and sets an alias for that library that may be easier to refer to. To refer to a thing defined in the library some_python_library
, use plib.name
.
In practice you'll see the second pattern used very frequently; pandas
referred to as pd
, numpy
referred to as np
, etc.
In [3]:
import math
number = 2
print math.sqrt(number)
In [4]:
import math as m
print m.log(number)
Matplotlib is one of the first python libraries a budding data scientist is likely to encounter. Matplotlib is a feature-rich plotting framework, capable of most plots you'll likely need. The interface to the matplotlib module mimics the plotting functionality in Matlab, another language and environment for scientific computing. If you're familiar with Matlab plots, matplotlib will seem very familiar. Even the plots look almost identical.
Here, we'll cover some basic functionality of matplotlib, line and bar plots and histograms. As with most content convered in this course, this is just scratching the surface. For more info, including many great examples, please consult the official matplotlib documentation. A typical pattern for me when plotting things in python is to find an example that closely mirrors what I'm trying to do, copy this, and tweak until i get things right.
Note: to get plots to appear inline in ipython notebooks, you must invoke the "magic function" %matplotlib inline
. To have a stand-alone python app plot in a new window, use plt.show()
.
In most cases, the input to matplotlib plotting functions is arrays of numerical types, floats or integers.
In [5]:
# used to embed plots inside an ipython notebook
%matplotlib inline
import matplotlib.pyplot as plt
# really simple example:
y = [1,2,3,4,5,4,3,2,1]
x = [1,2,3,4,5,6,7,8,9]
plt.plot(x, y)
Out[5]:
In [6]:
import numpy as np
X = np.linspace(0, 10, 10000)
Y = []
for x in X:
y = math.sin(x)
Y.append(y)
plt.plot(X, Y, 'r-.')
plt.title('The Sine Wave')
plt.xlabel('X')
plt.ylabel('sin(X)')
Out[6]:
Notice that most of the functionality in matplotlib that we're using is in the sub-module matplotlib.pyplot
.
The third argument in the plot function is a formatting specifier. This defines some properties for a line to be displayed. Some details: Color characters:
b
: blue
k
: black
r
: red
c
: cyan
m
: magenta
y
: yellow
g
: green
w
: white
Some line/marker formatting specifiers:
-
: solid line style
--
: dashed line style
-.
: dash-dot line style
:
: dotted line style
.
: point marker
,
: pixel marker
o
: circle marker
+
: plus marker
x
: x marker
There are many other options for plots that can be specified. See documentation for more info.
It is possible to plot multiple plots on the same y-axis. In order to do this, the Y data passed into the plot function must be a list of lists, each with the same length as the X data that is input:
In [7]:
Y = []
for x in X:
y = [math.sin(x), math.cos(x)]
Y.append(y)
plt.plot(X, Y)
plt.legend(['sin(x)', 'cos(x)'])
Out[7]:
It is also possible to just plot Y data without corresponding X values. In this case, the index in the array is assumed to be X.
In [8]:
plt.plot(Y)
plt.xlabel('index')
plt.ylabel('f(x)')
plt.legend(['sin(x)', 'cos(x)'])
Out[8]:
Alternately, multiple calls to plot can be made with differing data. Doing so overlays the subsequent plots, creating the same effect.
In [9]:
Y = []
Z = []
for x in X:
Y.append(math.sin(x))
Z.append(math.cos(x))
plt.plot(X, Y, 'b-.')
plt.plot(X, Z, 'r--')
plt.legend(['sin(x)', 'cos(x)'])
Out[9]:
Bar plots are often a good way to compare data in categories. This is an easy matter with matplotlib, the interface is almost identical to the that used when making line plots.
In [72]:
vals = [7, 6.2, 3, 5, 9]
xval = [1, 2, 3, 4, 5]
plt.bar(xval, vals)
Out[72]:
Histograms are extremely useful for analyzing data. Histograms partition numerical data into a discrete number of buckets (called bins), and return the number of values within each bucket. Typically this is displayed as a bar plot.
In [75]:
Y = []
for x in range(0,100000):
Y.append(np.random.randn())
plt.hist(Y, 50)
Out[75]: